"Visit with us" is a tourism company and the Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector. One of the ways to expand the customer base is to introduce a new offering of packages.
Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. This time company wants to harness the available data of existing and potential customers to target the right customers.
The objective is to analyze the customers' data and information to provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.
Following are the Key questions to be solved:
The records contain the Customer's personal information and their travel details & patterns. It also contains Customer interaction information during their sales pitch and their learnings from those sales discussions.
The detailed data dictionary is given below:
Customer Details
Customer Interaction Data
# Importing the Python Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
from IPython.display import Image
# Importing libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
# Used for Ignore warnings. When we generate the output, then we can use this ignore warning
import warnings
warnings.filterwarnings("ignore")
# this will help in making the Python code more structured automatically (good coding practice)
!pip install nb-black
%reload_ext nb_black
# Command to tell Python to actually display the graphs
%matplotlib inline
# let's start by installing plotly
!pip install plotly
# importing plotly
import plotly.express as px
# Command to hide the 'already satisfied' warnining from displaying
%pip install keras | grep -v 'already satisfied'
# Constant for making bold text
boldText = "\033[1m"
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 500)
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to build Bagging model
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# to build Boosting model
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
pd.set_option("mode.chained_assignment", None)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
# To tune different models
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
Requirement already satisfied: nb-black in c:\users\cpaul\anaconda3\lib\site-packages (1.0.7) Requirement already satisfied: black>='19.3' in c:\users\cpaul\anaconda3\lib\site-packages (from nb-black) (19.10b0) Requirement already satisfied: ipython in c:\users\cpaul\anaconda3\lib\site-packages (from nb-black) (7.29.0) Requirement already satisfied: pathspec<1,>=0.6 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (0.7.0) Requirement already satisfied: toml>=0.9.4 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (0.10.2) Requirement already satisfied: regex in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (2021.8.3) Requirement already satisfied: attrs>=18.1.0 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (21.2.0) Requirement already satisfied: appdirs in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (1.4.4) Requirement already satisfied: click>=6.5 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (8.0.3) Requirement already satisfied: typed-ast>=1.4.0 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (1.4.3) Requirement already satisfied: colorama in c:\users\cpaul\anaconda3\lib\site-packages (from click>=6.5->black>='19.3'->nb-black) (0.4.4) Requirement already satisfied: matplotlib-inline in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.1.2) Requirement already satisfied: pickleshare in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.7.5) Requirement already satisfied: jedi>=0.16 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.18.0) Requirement already satisfied: decorator in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (5.1.0) Requirement already satisfied: pygments in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (2.10.0) Requirement already satisfied: setuptools>=18.5 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (58.0.4) Requirement already satisfied: traitlets>=4.2 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (5.1.0) Requirement already satisfied: backcall in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.2.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (3.0.20) Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\cpaul\anaconda3\lib\site-packages (from jedi>=0.16->ipython->nb-black) (0.8.2) Requirement already satisfied: wcwidth in c:\users\cpaul\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->nb-black) (0.2.5) Requirement already satisfied: plotly in c:\users\cpaul\anaconda3\lib\site-packages (5.7.0) Requirement already satisfied: six in c:\users\cpaul\anaconda3\lib\site-packages (from plotly) (1.16.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\cpaul\anaconda3\lib\site-packages (from plotly) (8.0.1) Note: you may need to restart the kernel to use updated packages.
# Loading Used Cars Dataset
xls = pd.ExcelFile("../Dataset/Tourism.xlsx")
df_dict = pd.read_excel(xls, "Data Dict")
df = pd.read_excel(xls, "Tourism")
# df = pd.read_excel("../Dataset/Tourism.xlsx", "")
# same random results every time
np.random.seed(1)
df.sample(n=10)
# To copy the data to another object
custData = df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
# Command to understand the total number of data collected
print(
f"- There are {df.shape[0]} row samples and {df.shape[1]} attributes of the customer information collected in this dataset."
)
- There are 4888 row samples and 20 attributes of the customer information collected in this dataset.
df.head(5) # Displaying the fist 10 rows of the Dataset
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
df.tail(5) # Displaying the last 10 rows of the Dataset
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
| 4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
| 4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | NaN | NaN | NaN | 202443.5 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| ProdTaken | 4888.0 | NaN | NaN | NaN | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4662.0 | NaN | NaN | NaN | 37.622265 | 9.316387 | 18.0 | 31.0 | 36.0 | 44.0 | 61.0 |
| TypeofContact | 4863 | 2 | Self Enquiry | 3444 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 4888.0 | NaN | NaN | NaN | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4637.0 | NaN | NaN | NaN | 15.490835 | 8.519643 | 5.0 | 9.0 | 13.0 | 20.0 | 127.0 |
| Occupation | 4888 | 4 | Salaried | 2368 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 4888 | 3 | Male | 2916 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 4888.0 | NaN | NaN | NaN | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4843.0 | NaN | NaN | NaN | 3.708445 | 1.002509 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| ProductPitched | 4888 | 5 | Basic | 1842 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 4862.0 | NaN | NaN | NaN | 3.581037 | 0.798009 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| MaritalStatus | 4888 | 4 | Married | 2340 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 4748.0 | NaN | NaN | NaN | 3.236521 | 1.849019 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4888.0 | NaN | NaN | NaN | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4888.0 | NaN | NaN | NaN | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4888.0 | NaN | NaN | NaN | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | NaN | NaN | NaN | 1.187267 | 0.857861 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 4888 | 5 | Executive | 1842 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 4655.0 | NaN | NaN | NaN | 23619.853491 | 5380.698361 | 1000.0 | 20346.0 | 22347.0 | 25571.0 | 98678.0 |
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
Data Description: Click to return to TOC
# Dropping the 'ID' columns since its not required
df.drop(["CustomerID"], axis=1, inplace=True)
print("Dropped the 'Customer ID' attribute since its not required")
Dropped the 'Customer ID' attribute since its not required
# Checking for duplicated rows in the dataset
duplicateSum = df.duplicated().sum()
if duplicateSum > 0:
print(f"- There are {str(duplicateSum)} duplicated row(s) in the dataset")
# Removing the duplicated rows in the dataset
df.drop_duplicates(inplace=True)
print(
f"- There are {str(df.duplicated().sum())} duplicated row(s) in the dataset post cleaning"
)
df.duplicated().sum()
# resetting the index of data frame since some rows will be removed
df.reset_index(drop=True, inplace=True)
else:
print("- There are no duplicated row(s) in the dataset")
- There are 141 duplicated row(s) in the dataset - There are 0 duplicated row(s) in the dataset post cleaning
df.isnull().sum()
ProdTaken 0 Age 216 TypeofContact 25 CityTier 0 DurationOfPitch 246 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 44 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 138 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 60 Designation 0 MonthlyIncome 224 dtype: int64
Age
df["Age"].value_counts(dropna=False)
35.0 231 36.0 223 NaN 216 34.0 203 30.0 193 32.0 190 31.0 189 37.0 182 33.0 182 29.0 177 38.0 172 41.0 150 39.0 148 40.0 143 28.0 143 42.0 137 27.0 135 43.0 125 46.0 117 45.0 110 26.0 104 44.0 99 51.0 88 47.0 87 50.0 84 25.0 73 52.0 68 49.0 65 48.0 64 53.0 64 55.0 63 54.0 59 24.0 56 56.0 55 23.0 46 22.0 46 59.0 42 21.0 41 20.0 38 19.0 32 58.0 30 57.0 28 60.0 27 18.0 14 61.0 8 Name: Age, dtype: int64
# Filling the default values with Median values
df["Age"].fillna(value=df["Age"].median(), inplace=True)
# Defining bins for splitting the age to groups and creating a new column
bins = [10, 20, 30, 40, 50, 60, 70]
labels = [
"Less_than_20",
"Less_than_30",
"Less_than_40",
"Less_than_50",
"Less_than_60",
"Less_than_70",
]
df["AgeGroup"] = pd.cut(df["Age"], bins=bins, labels=labels, right=False)
df["AgeGroup"] = df["AgeGroup"].astype("category")
df["AgeGroup"].value_counts(dropna=False)
Less_than_40 2129 Less_than_50 1097 Less_than_30 859 Less_than_60 581 Less_than_20 46 Less_than_70 35 Name: AgeGroup, dtype: int64
Duration of Pitch
df["DurationOfPitch"].value_counts(dropna=False)
9.0 466 7.0 334 8.0 324 6.0 299 16.0 270 15.0 262 NaN 246 14.0 245 10.0 234 13.0 213 11.0 196 12.0 187 17.0 169 30.0 90 22.0 88 31.0 80 23.0 77 18.0 73 27.0 72 25.0 72 32.0 72 26.0 71 29.0 71 21.0 70 24.0 69 35.0 65 28.0 61 20.0 61 33.0 56 19.0 55 34.0 50 36.0 41 5.0 6 126.0 1 127.0 1 Name: DurationOfPitch, dtype: int64
# Filling the default values with Median values
df["DurationOfPitch"].fillna(value=df["DurationOfPitch"].median(), inplace=True)
Monthly Income
df["MonthlyIncome"].value_counts(dropna=False)
NaN 224
21020.0 7
17342.0 7
21288.0 7
20855.0 7
...
23463.0 1
28757.0 1
17742.0 1
20486.0 1
21471.0 1
Name: MonthlyIncome, Length: 2476, dtype: int64
# Filling the default values with Median values
df["MonthlyIncome"].fillna(value=df["MonthlyIncome"].median(), inplace=True)
Type of Contact
df["TypeofContact"].value_counts(dropna=False)
Self Enquiry 3350 Company Invited 1372 NaN 25 Name: TypeofContact, dtype: int64
# Filling the default values with Self Enquiry since its the most occurence in the dataset
df["TypeofContact"].fillna("Self Enquiry", inplace=True)
df["TypeofContact"] = df["TypeofContact"].astype("category")
df["TypeofContact"].value_counts(dropna=False)
Self Enquiry 3375 Company Invited 1372 Name: TypeofContact, dtype: int64
Number of Followups
df["NumberOfFollowups"].value_counts(dropna=False)
4.0 1999 3.0 1421 5.0 745 2.0 228 1.0 175 6.0 135 NaN 44 Name: NumberOfFollowups, dtype: int64
# Filling the default values with Median values
df["NumberOfFollowups"].fillna(value=df["NumberOfFollowups"].median(), inplace=True)
# df["NumberOfFollowups"] = df["NumberOfFollowups"].astype("category")
df["NumberOfFollowups"].value_counts(dropna=False)
4.0 2043 3.0 1421 5.0 745 2.0 228 1.0 175 6.0 135 Name: NumberOfFollowups, dtype: int64
Prefferred Property Star
df["PreferredPropertyStar"].value_counts(dropna=False)
3.0 2905 5.0 938 4.0 878 NaN 26 Name: PreferredPropertyStar, dtype: int64
# Filling the default values with Median values
df["PreferredPropertyStar"].fillna(
value=df["PreferredPropertyStar"].median(), inplace=True
)
# df["PreferredPropertyStar"] = df["PreferredPropertyStar"].astype("category")
df["PreferredPropertyStar"].value_counts(dropna=False)
3.0 2931 5.0 938 4.0 878 Name: PreferredPropertyStar, dtype: int64
Number of Trips
df["NumberOfTrips"].value_counts(dropna=False)
2.0 1422 3.0 1051 1.0 601 4.0 468 5.0 443 6.0 307 7.0 211 NaN 138 8.0 102 19.0 1 21.0 1 20.0 1 22.0 1 Name: NumberOfTrips, dtype: int64
# Filling the default values with Median values
df["NumberOfTrips"].fillna(value=df["NumberOfTrips"].median(), inplace=True)
df["NumberOfTrips"].value_counts(dropna=False)
2.0 1422 3.0 1189 1.0 601 4.0 468 5.0 443 6.0 307 7.0 211 8.0 102 19.0 1 21.0 1 20.0 1 22.0 1 Name: NumberOfTrips, dtype: int64
Number of Children Visiting
df["NumberOfChildrenVisiting"].value_counts(dropna=False)
1.0 2014 2.0 1304 0.0 1045 3.0 324 NaN 60 Name: NumberOfChildrenVisiting, dtype: int64
# Filling the default values with Median values
df["NumberOfChildrenVisiting"].fillna(
value=df["NumberOfChildrenVisiting"].median(), inplace=True
)
df["NumberOfChildrenVisiting"].value_counts(dropna=False)
1.0 2074 2.0 1304 0.0 1045 3.0 324 Name: NumberOfChildrenVisiting, dtype: int64
Gender
df["Gender"].value_counts(dropna=False)
Male 2835 Female 1769 Fe Male 143 Name: Gender, dtype: int64
# Correcting the data with incorrect text to "Female"
df["Gender"] = df["Gender"].str.replace("Fe Male", "Female")
df["Gender"] = df["Gender"].astype("category")
df["Gender"].value_counts(dropna=False)
Male 2835 Female 1912 Name: Gender, dtype: int64
Converting columns that has Categorical values to a category type
# Converting columns that has Categorical variables to a Category type.
df["Occupation"] = df["Occupation"].astype("category")
df["Gender"] = df["Gender"].astype("category")
df["ProductPitched"] = df["ProductPitched"].astype("category")
df["MaritalStatus"] = df["MaritalStatus"].astype("category")
df["Designation"] = df["Designation"].astype("category")
df.isnull().sum()
ProdTaken 0 Age 0 TypeofContact 0 CityTier 0 DurationOfPitch 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 0 ProductPitched 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 MonthlyIncome 0 AgeGroup 0 dtype: int64
# Checking for duplicated rows in the dataset
duplicateSum = df.duplicated().sum()
if duplicateSum > 0:
print(f"- There are {str(duplicateSum)} duplicated row(s) in the dataset")
# Removing the duplicated rows in the dataset
df.drop_duplicates(inplace=True)
print(
f"- There are {str(df.duplicated().sum())} duplicated row(s) in the dataset post cleaning"
)
df.duplicated().sum()
# resetting the index of data frame since some rows will be removed
df.reset_index(drop=True, inplace=True)
else:
print("- There are no duplicated row(s) in the dataset")
- There are no duplicated row(s) in the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4747 entries, 0 to 4746 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4747 non-null int64 1 Age 4747 non-null float64 2 TypeofContact 4747 non-null category 3 CityTier 4747 non-null int64 4 DurationOfPitch 4747 non-null float64 5 Occupation 4747 non-null category 6 Gender 4747 non-null category 7 NumberOfPersonVisiting 4747 non-null int64 8 NumberOfFollowups 4747 non-null float64 9 ProductPitched 4747 non-null category 10 PreferredPropertyStar 4747 non-null float64 11 MaritalStatus 4747 non-null category 12 NumberOfTrips 4747 non-null float64 13 Passport 4747 non-null int64 14 PitchSatisfactionScore 4747 non-null int64 15 OwnCar 4747 non-null int64 16 NumberOfChildrenVisiting 4747 non-null float64 17 Designation 4747 non-null category 18 MonthlyIncome 4747 non-null float64 19 AgeGroup 4747 non-null category dtypes: category(7), float64(7), int64(6) memory usage: 516.0 KB
# Command to understand the total number of data collected
print(
f"- There are {df.shape[0]} row samples and {df.shape[1]} attributes of the customer information collected in this dataset."
)
- There are 4747 row samples and 20 attributes of the customer information collected in this dataset.
category_columnNames = df.describe(include=["category"]).columns
category_columnNames
Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
'MaritalStatus', 'Designation', 'AgeGroup'],
dtype='object')
number_columnNames = (
df.describe(include=["int64"]).columns.tolist()
+ df.describe(include=["float64"]).columns.tolist()
)
number_columnNames
['ProdTaken', 'CityTier', 'NumberOfPersonVisiting', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'Age', 'DurationOfPitch', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'NumberOfChildrenVisiting', 'MonthlyIncome']
catnumber_cols = df[
[
"CityTier",
"NumberOfPersonVisiting",
"Passport",
"PitchSatisfactionScore",
"OwnCar",
"NumberOfFollowups",
"PreferredPropertyStar",
"NumberOfChildrenVisiting",
]
].columns.tolist()
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4747.0 | NaN | NaN | NaN | 0.188329 | 0.391016 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4747.0 | NaN | NaN | NaN | 37.513377 | 9.119956 | 18.0 | 31.0 | 36.0 | 43.0 | 61.0 |
| TypeofContact | 4747 | 2 | Self Enquiry | 3375 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 4747.0 | NaN | NaN | NaN | 1.655151 | 0.917416 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4747.0 | NaN | NaN | NaN | 15.380872 | 8.330097 | 5.0 | 9.0 | 13.0 | 19.0 | 127.0 |
| Occupation | 4747 | 4 | Salaried | 2293 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 4747 | 2 | Male | 2835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 4747.0 | NaN | NaN | NaN | 2.911734 | 0.72404 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4747.0 | NaN | NaN | NaN | 3.707815 | 1.004388 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| ProductPitched | 4747 | 5 | Basic | 1800 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 4747.0 | NaN | NaN | NaN | 3.580156 | 0.799316 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| MaritalStatus | 4747 | 4 | Married | 2279 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 4747.0 | NaN | NaN | NaN | 3.226459 | 1.82121 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4747.0 | NaN | NaN | NaN | 0.289657 | 0.453651 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4747.0 | NaN | NaN | NaN | 3.051612 | 1.369584 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4747.0 | NaN | NaN | NaN | 0.617653 | 0.486012 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4747.0 | NaN | NaN | NaN | 1.191068 | 0.855278 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 4747 | 5 | Executive | 1800 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 4747.0 | NaN | NaN | NaN | 23541.308827 | 5264.00234 | 1000.0 | 20474.5 | 22311.0 | 25389.0 | 98678.0 |
| AgeGroup | 4747 | 6 | Less_than_40 | 2129 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Data Structure:
Data Cleaning:
Data Description:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, hueCol=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 7))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
hue=hueCol,
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
# annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True,).sort_values(
by=sorter, ascending=False
)
print("-" * 30, " Volume ", "-" * 30)
print(tab1)
tab1 = pd.crosstab(
data[predictor], data[target], margins=True, normalize="index"
).sort_values(by=sorter, ascending=False)
print("-" * 30, " Percentage % ", "-" * 30)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Creating a common function to draw a Boxplot & a Histogram for each of the analysis
def histogram_boxplot(data, feature, figsize=(15, 7), kde=True, bins=None):
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
if bins:
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
)
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# functions to treat outliers by flooring and capping
def treat_outliers(df, col, lower=0.25, upper=0.75, mul=1.5):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(lower) # 25th quantile
Q3 = df[col].quantile(upper) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - (mul * IQR)
Upper_Whisker = Q3 + (mul * IQR)
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list, lower=0.25, upper=0.75, mul=1.5):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c, lower, upper, mul)
return df
# Summary of data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4747.0 | NaN | NaN | NaN | 0.188329 | 0.391016 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4747.0 | NaN | NaN | NaN | 37.513377 | 9.119956 | 18.0 | 31.0 | 36.0 | 43.0 | 61.0 |
| TypeofContact | 4747 | 2 | Self Enquiry | 3375 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 4747.0 | NaN | NaN | NaN | 1.655151 | 0.917416 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4747.0 | NaN | NaN | NaN | 15.380872 | 8.330097 | 5.0 | 9.0 | 13.0 | 19.0 | 127.0 |
| Occupation | 4747 | 4 | Salaried | 2293 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 4747 | 2 | Male | 2835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 4747.0 | NaN | NaN | NaN | 2.911734 | 0.72404 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4747.0 | NaN | NaN | NaN | 3.707815 | 1.004388 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| ProductPitched | 4747 | 5 | Basic | 1800 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 4747.0 | NaN | NaN | NaN | 3.580156 | 0.799316 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| MaritalStatus | 4747 | 4 | Married | 2279 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 4747.0 | NaN | NaN | NaN | 3.226459 | 1.82121 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4747.0 | NaN | NaN | NaN | 0.289657 | 0.453651 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4747.0 | NaN | NaN | NaN | 3.051612 | 1.369584 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4747.0 | NaN | NaN | NaN | 0.617653 | 0.486012 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4747.0 | NaN | NaN | NaN | 1.191068 | 0.855278 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 4747 | 5 | Executive | 1800 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 4747.0 | NaN | NaN | NaN | 23541.308827 | 5264.00234 | 1000.0 | 20474.5 | 22311.0 | 25389.0 | 98678.0 |
| AgeGroup | 4747 | 6 | Less_than_40 | 2129 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# printing the number of occurrences of each unique value in each categorical column
num_to_display = 15
for column in category_columnNames:
val_counts = df[column].value_counts(
dropna=False
) # Kept dropNA to False to see the NA value count as well
val_countsP = df[column].value_counts(dropna=False, normalize=True)
print("Unique values in", column, "are :")
print(val_counts[:num_to_display])
# print(val_countsP[:num_to_display])
if len(val_counts) > num_to_display:
print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
labeled_barplot(df, column, perc=True, n=5)
plt.tight_layout()
print("-" * 50)
print(" ")
Unique values in TypeofContact are : Self Enquiry 3375 Company Invited 1372 Name: TypeofContact, dtype: int64
-------------------------------------------------- Unique values in Occupation are : Salaried 2293 Small Business 2028 Large Business 424 Free Lancer 2 Name: Occupation, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Gender are : Male 2835 Female 1912 Name: Gender, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in ProductPitched are : Basic 1800 Deluxe 1684 Standard 714 Super Deluxe 324 King 225 Name: ProductPitched, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in MaritalStatus are : Married 2279 Divorced 950 Single 875 Unmarried 643 Name: MaritalStatus, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Designation are : Executive 1800 Manager 1684 Senior Manager 714 AVP 324 VP 225 Name: Designation, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in AgeGroup are : Less_than_40 2129 Less_than_50 1097 Less_than_30 859 Less_than_60 581 Less_than_20 46 Less_than_70 35 Name: AgeGroup, dtype: int64
<Figure size 432x288 with 0 Axes>
--------------------------------------------------
<Figure size 432x288 with 0 Axes>
Observations:
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
# Summary of numeric data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4747.0 | 0.188329 | 0.391016 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4747.0 | 37.513377 | 9.119956 | 18.0 | 31.0 | 36.0 | 43.0 | 61.0 |
| CityTier | 4747.0 | 1.655151 | 0.917416 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4747.0 | 15.380872 | 8.330097 | 5.0 | 9.0 | 13.0 | 19.0 | 127.0 |
| NumberOfPersonVisiting | 4747.0 | 2.911734 | 0.724040 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4747.0 | 3.707815 | 1.004388 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| PreferredPropertyStar | 4747.0 | 3.580156 | 0.799316 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| NumberOfTrips | 4747.0 | 3.226459 | 1.821210 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4747.0 | 0.289657 | 0.453651 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4747.0 | 3.051612 | 1.369584 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4747.0 | 0.617653 | 0.486012 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4747.0 | 1.191068 | 0.855278 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| MonthlyIncome | 4747.0 | 23541.308827 | 5264.002340 | 1000.0 | 20474.5 | 22311.0 | 25389.0 | 98678.0 |
Observations:
histogram_boxplot(df, "Age")
Observations:
histogram_boxplot(df, "DurationOfPitch")
Observations:
df[df.DurationOfPitch > 36]["DurationOfPitch"].describe()
count 2.000000 mean 126.500000 std 0.707107 min 126.000000 25% 126.250000 50% 126.500000 75% 126.750000 max 127.000000 Name: DurationOfPitch, dtype: float64
# Dropping the outliers that are more than the 4IQR
df.drop(df[df.DurationOfPitch > 36].index, inplace=True)
df.reset_index(drop=True, inplace=True)
histogram_boxplot(df, "DurationOfPitch")
histogram_boxplot(df, "NumberOfPersonVisiting")
df[df.NumberOfPersonVisiting > 4]["NumberOfPersonVisiting"].describe()
count 3.0 mean 5.0 std 0.0 min 5.0 25% 5.0 50% 5.0 75% 5.0 max 5.0 Name: NumberOfPersonVisiting, dtype: float64
Observations:
histogram_boxplot(df, "NumberOfTrips")
df[df.NumberOfTrips > 9]["NumberOfTrips"].describe()
count 4.000000 mean 20.500000 std 1.290994 min 19.000000 25% 19.750000 50% 20.500000 75% 21.250000 max 22.000000 Name: NumberOfTrips, dtype: float64
# Dropping the outliers that are more than the 4IQR
df.drop(df[df.NumberOfTrips > 8].index, inplace=True)
df.reset_index(drop=True, inplace=True)
histogram_boxplot(df, "NumberOfTrips")
Observations:
histogram_boxplot(df, "NumberOfChildrenVisiting")
Observations:
histogram_boxplot(df, "MonthlyIncome")
Observations:
df = treat_outliers(df, "MonthlyIncome", 0.2, 0.8, 1.5)
histogram_boxplot(df, "MonthlyIncome")
# Plotting Heatmap by creating a 2-D Matrix with correlation plots
correlation = df.corr()
plt.figure(figsize=(15, 7))
sns.heatmap(correlation, vmin=-1, vmax=1, annot=True, cmap="Spectral")
<AxesSubplot:>
sns.pairplot(df, corner=True, hue="ProdTaken")
<seaborn.axisgrid.PairGrid at 0x15b5c379bb0>
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
labeled_barplot(df, cols, perc=True, n=10, hueCol="ProdTaken")
plt.tight_layout()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
cols = df[
[
"Age",
"DurationOfPitch",
"NumberOfPersonVisiting",
"NumberOfTrips",
"NumberOfChildrenVisiting",
"MonthlyIncome",
"NumberOfFollowups",
"PitchSatisfactionScore",
]
].columns.tolist()
plt.figure(figsize=(15, 25))
for i, variable in enumerate(cols):
plt.subplot(5, 2, i + 1)
sns.boxplot(df["ProdTaken"], df[variable], palette="PuBu", showfliers=False)
plt.tight_layout()
plt.title(variable)
plt.show()
Observation:
Age Group vs Product Taken
stacked_barplot(df, "AgeGroup", "ProdTaken")
------------------------------ Volume ------------------------------ ProdTaken 0 1 All AgeGroup All 3849 892 4741 Less_than_40 1773 352 2125 Less_than_30 597 262 859 Less_than_50 948 148 1096 Less_than_60 480 100 580 Less_than_20 17 29 46 Less_than_70 34 1 35 ------------------------------ Percentage % ------------------------------ ProdTaken 0 1 AgeGroup Less_than_20 0.369565 0.630435 Less_than_30 0.694994 0.305006 All 0.811854 0.188146 Less_than_60 0.827586 0.172414 Less_than_40 0.834353 0.165647 Less_than_50 0.864964 0.135036 Less_than_70 0.971429 0.028571 ------------------------------------------------------------------------------------------------------------------------
Observations:
TypeofContract vs Product Taken
stacked_barplot(df, "TypeofContact", "ProdTaken")
------------------------------ Volume ------------------------------ ProdTaken 0 1 All TypeofContact All 3849 892 4741 Self Enquiry 2775 600 3375 Company Invited 1074 292 1366 ------------------------------ Percentage % ------------------------------ ProdTaken 0 1 TypeofContact Company Invited 0.786237 0.213763 All 0.811854 0.188146 Self Enquiry 0.822222 0.177778 ------------------------------------------------------------------------------------------------------------------------
Observations:
City Tier vs Personal Loan
stacked_barplot(df, "CityTier", "ProdTaken")
------------------------------ Volume ------------------------------ ProdTaken 0 1 All CityTier All 3849 892 4741 1 2590 504 3094 3 1113 346 1459 2 146 42 188 ------------------------------ Percentage % ------------------------------ ProdTaken 0 1 CityTier 3 0.762851 0.237149 2 0.776596 0.223404 All 0.811854 0.188146 1 0.837104 0.162896 ------------------------------------------------------------------------------------------------------------------------
Observations:
Occupation vs Prodcut Taken
stacked_barplot(df, "Occupation", "ProdTaken")
------------------------------ Volume ------------------------------ ProdTaken 0 1 All Occupation All 3849 892 4741 Salaried 1889 400 2289 Small Business 1654 374 2028 Large Business 306 116 422 Free Lancer 0 2 2 ------------------------------ Percentage % ------------------------------ ProdTaken 0 1 Occupation Free Lancer 0.000000 1.000000 Large Business 0.725118 0.274882 All 0.811854 0.188146 Small Business 0.815582 0.184418 Salaried 0.825251 0.174749 ------------------------------------------------------------------------------------------------------------------------
Observations:
Gender vs Prod Taken
stacked_barplot(df, "Gender", "ProdTaken")
------------------------------ Volume ------------------------------ ProdTaken 0 1 All Gender All 3849 892 4741 Male 2269 560 2829 Female 1580 332 1912 ------------------------------ Percentage % ------------------------------ ProdTaken 0 1 Gender Male 0.802050 0.197950 All 0.811854 0.188146 Female 0.826360 0.173640 ------------------------------------------------------------------------------------------------------------------------
Observations:
Marital Status vs Prod Taken
stacked_barplot(df, "MaritalStatus", "ProdTaken")
------------------------------ Volume ------------------------------ ProdTaken 0 1 All MaritalStatus All 3849 892 4741 Married 1963 314 2277 Single 578 295 873 Unmarried 482 159 641 Divorced 826 124 950 ------------------------------ Percentage % ------------------------------ ProdTaken 0 1 MaritalStatus Single 0.662085 0.337915 Unmarried 0.751950 0.248050 All 0.811854 0.188146 Married 0.862099 0.137901 Divorced 0.869474 0.130526 ------------------------------------------------------------------------------------------------------------------------
Observations:
CityTier vs Type of Contact vs Prod Taken
g = sns.FacetGrid(df, col="CityTier", hue="ProdTaken", col_wrap=4, margin_titles=True)
g.map(sns.histplot, "TypeofContact")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x15b6b311f10>
Observations:
CityTier vs Gender vs Prod Taken
g = sns.FacetGrid(df, col="CityTier", hue="ProdTaken", col_wrap=4, margin_titles=True)
g.map(sns.histplot, "Gender")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x15b6b42fc70>
Observations:
Number of Followups vs Income vs Prod Taken
g = sns.FacetGrid(
df, col="NumberOfFollowups", hue="ProdTaken", col_wrap=4, margin_titles=True
)
g.map(sns.scatterplot, "NumberOfFollowups", "MonthlyIncome")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x15b6b46d280>
Observations:
Number of Followups vs Duration of Pitch vs Product Pitched vs Prod Taken
g = sns.FacetGrid(
df, col="NumberOfFollowups", hue="ProdTaken", col_wrap=4, margin_titles=True
)
g.map(sns.scatterplot, "DurationOfPitch", "ProductPitched")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x15b6b76ff40>
Observations:
Income vs Education vs Personal Loan
g = sns.FacetGrid(
df, col="Designation", hue="ProdTaken", col_wrap=4, margin_titles=True
)
g.map(sns.scatterplot, "MonthlyIncome", "Age")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x15b70209a60>
Observations
PitchSatisfactionScore vs Duration of Pitch vs Product Pitched
plt.figure(figsize=(15, 7))
sns.boxplot(
x="PitchSatisfactionScore", y="DurationOfPitch", data=df, hue="ProductPitched"
)
plt.show()
Observations
df_ProdTaken = df[df["ProdTaken"].astype("int") == 1]
# creating histograms
df_ProdTaken.hist(figsize=(14, 14))
plt.show()
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(catnumber_cols)), catnumber_cols):
labeled_barplot(df_ProdTaken, cols, perc=True, n=10, hueCol="ProductPitched");
plt.tight_layout();
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
labeled_barplot(df_ProdTaken, cols, perc=True, n=10, hueCol="ProductPitched");
plt.tight_layout();
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
df_ProdBasic = df[
(df["ProductPitched"] == "Basic") & df["ProdTaken"].astype("int") == 1
]
df_ProdBasic.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 538.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Age | 538.0 | NaN | NaN | NaN | 31.548327 | 8.906011 | 18.0 | 26.0 | 30.0 | 36.0 | 59.0 |
| TypeofContact | 538 | 2 | Self Enquiry | 352 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 538.0 | NaN | NaN | NaN | 1.520446 | 0.839299 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| DurationOfPitch | 538.0 | NaN | NaN | NaN | 15.654275 | 7.776881 | 6.0 | 9.0 | 13.0 | 21.0 | 36.0 |
| Occupation | 538 | 4 | Salaried | 252 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 538 | 2 | Male | 334 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 538.0 | NaN | NaN | NaN | 2.912639 | 0.702343 | 2.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 538.0 | NaN | NaN | NaN | 3.951673 | 0.965652 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| ProductPitched | 538 | 1 | Basic | 538 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 538.0 | NaN | NaN | NaN | 3.784387 | 0.866282 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 |
| MaritalStatus | 538 | 4 | Single | 225 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 538.0 | NaN | NaN | NaN | 3.184015 | 1.835296 | 1.0 | 2.0 | 3.0 | 3.0 | 8.0 |
| Passport | 538.0 | NaN | NaN | NaN | 0.579926 | 0.49403 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 538.0 | NaN | NaN | NaN | 3.185874 | 1.351882 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 538.0 | NaN | NaN | NaN | 0.576208 | 0.494618 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 538.0 | NaN | NaN | NaN | 1.224907 | 0.869284 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 538 | 1 | Executive | 538 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 538.0 | NaN | NaN | NaN | 20255.420074 | 3288.711356 | 16009.0 | 17564.0 | 20721.0 | 21529.0 | 37376.5 |
| AgeGroup | 538 | 5 | Less_than_30 | 220 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# creating histograms
df_ProdBasic.hist(figsize=(14, 14))
plt.show()
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(catnumber_cols)), catnumber_cols):
labeled_barplot(df_ProdBasic, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
labeled_barplot(df_ProdBasic, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
df_ProdStd = df[
(df["ProductPitched"] == "Standard") & df["ProdTaken"].astype("int") == 1
]
df_ProdStd.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 120.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Age | 120.0 | NaN | NaN | NaN | 41.166667 | 9.948044 | 19.0 | 33.0 | 38.0 | 49.0 | 60.0 |
| TypeofContact | 120 | 2 | Self Enquiry | 90 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 120.0 | NaN | NaN | NaN | 2.083333 | 0.975182 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 |
| DurationOfPitch | 120.0 | NaN | NaN | NaN | 18.983333 | 9.009783 | 6.0 | 11.0 | 17.0 | 27.5 | 36.0 |
| Occupation | 120 | 3 | Small Business | 58 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 120 | 2 | Male | 74 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 120.0 | NaN | NaN | NaN | 2.983333 | 0.709874 | 2.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 120.0 | NaN | NaN | NaN | 3.95 | 0.915322 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| ProductPitched | 120 | 1 | Standard | 120 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 120.0 | NaN | NaN | NaN | 3.683333 | 0.859777 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 |
| MaritalStatus | 120 | 4 | Married | 54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 120.0 | NaN | NaN | NaN | 3.066667 | 1.813735 | 1.0 | 2.0 | 3.0 | 4.0 | 8.0 |
| Passport | 120.0 | NaN | NaN | NaN | 0.366667 | 0.483915 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 120.0 | NaN | NaN | NaN | 3.466667 | 1.328001 | 1.0 | 3.0 | 3.0 | 5.0 | 5.0 |
| OwnCar | 120.0 | NaN | NaN | NaN | 0.65 | 0.478969 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 120.0 | NaN | NaN | NaN | 1.125 | 0.903425 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
| Designation | 120 | 1 | Senior Manager | 120 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 120.0 | NaN | NaN | NaN | 26007.945833 | 3607.826435 | 17372.0 | 23722.0 | 25711.0 | 28642.5 | 37376.5 |
| AgeGroup | 120 | 6 | Less_than_40 | 48 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# creating histograms
df_ProdStd.hist(figsize=(14, 14))
plt.show()
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(catnumber_cols)), catnumber_cols):
labeled_barplot(df_ProdStd, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
labeled_barplot(df_ProdStd, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
df_ProdDel = df[(df["ProductPitched"] == "Deluxe") & df["ProdTaken"].astype("int") == 1]
df_ProdDel.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 198.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Age | 198.0 | NaN | NaN | NaN | 37.636364 | 8.444449 | 21.0 | 32.0 | 36.0 | 44.0 | 59.0 |
| TypeofContact | 198 | 2 | Self Enquiry | 134 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 198.0 | NaN | NaN | NaN | 2.414141 | 0.91252 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 |
| DurationOfPitch | 198.0 | NaN | NaN | NaN | 18.358586 | 8.878635 | 6.0 | 12.0 | 15.0 | 26.0 | 36.0 |
| Occupation | 198 | 3 | Small Business | 102 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 198 | 2 | Male | 132 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 198.0 | NaN | NaN | NaN | 2.954545 | 0.707433 | 2.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 198.0 | NaN | NaN | NaN | 3.974747 | 1.049229 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| ProductPitched | 198 | 1 | Deluxe | 198 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 198.0 | NaN | NaN | NaN | 3.686869 | 0.856702 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 |
| MaritalStatus | 198 | 4 | Married | 68 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 198.0 | NaN | NaN | NaN | 3.691919 | 2.015432 | 1.0 | 2.0 | 3.0 | 5.0 | 8.0 |
| Passport | 198.0 | NaN | NaN | NaN | 0.505051 | 0.501242 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 198.0 | NaN | NaN | NaN | 3.020202 | 1.282287 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 198.0 | NaN | NaN | NaN | 0.59596 | 0.491949 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 198.0 | NaN | NaN | NaN | 1.191919 | 0.839105 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 198 | 1 | Manager | 198 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 198.0 | NaN | NaN | NaN | 23059.919192 | 3508.493985 | 17086.0 | 20764.25 | 22904.5 | 24479.0 | 37376.5 |
| AgeGroup | 198 | 4 | Less_than_40 | 103 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# creating histograms
df_ProdDel.hist(figsize=(14, 14))
plt.show()
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(catnumber_cols)), catnumber_cols):
labeled_barplot(df_ProdDel, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
labeled_barplot(df_ProdDel, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
df_ProdSupDeluxe = df[
(df["ProductPitched"] == "Super Deluxe") & df["ProdTaken"].astype("int") == 1
]
df_ProdSupDeluxe.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 16.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Age | 16.0 | NaN | NaN | NaN | 44.125 | 5.188127 | 39.0 | 40.75 | 42.0 | 46.25 | 56.0 |
| TypeofContact | 16 | 2 | Company Invited | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 16.0 | NaN | NaN | NaN | 2.75 | 0.68313 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 |
| DurationOfPitch | 16.0 | NaN | NaN | NaN | 19.75 | 7.28011 | 8.0 | 15.75 | 19.0 | 22.5 | 31.0 |
| Occupation | 16 | 2 | Salaried | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 16 | 2 | Male | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 16.0 | NaN | NaN | NaN | 2.75 | 0.68313 | 2.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 16.0 | NaN | NaN | NaN | 3.0 | 1.788854 | 1.0 | 1.75 | 2.5 | 4.25 | 6.0 |
| ProductPitched | 16 | 1 | Super Deluxe | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 16.0 | NaN | NaN | NaN | 3.5 | 0.730297 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| MaritalStatus | 16 | 3 | Single | 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 16.0 | NaN | NaN | NaN | 3.6875 | 2.5224 | 1.0 | 1.75 | 2.5 | 6.0 | 8.0 |
| Passport | 16.0 | NaN | NaN | NaN | 0.625 | 0.5 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 16.0 | NaN | NaN | NaN | 3.75 | 1.0 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 |
| OwnCar | 16.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 16.0 | NaN | NaN | NaN | 1.25 | 0.856349 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 16 | 1 | AVP | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 16.0 | NaN | NaN | NaN | 29821.28125 | 3807.962185 | 21151.0 | 28129.5 | 29802.5 | 31997.25 | 37376.5 |
| AgeGroup | 16 | 3 | Less_than_50 | 13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# creating histograms
df_ProdSupDeluxe.hist(figsize=(14, 14))
plt.show()
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(catnumber_cols)), catnumber_cols):
labeled_barplot(df_ProdSupDeluxe, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
labeled_barplot(df_ProdSupDeluxe, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
df_ProdKing = df[(df["ProductPitched"] == "King") & df["ProdTaken"].astype("int") == 1]
df_ProdKing.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 20.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Age | 20.0 | NaN | NaN | NaN | 48.9 | 9.618513 | 27.0 | 42.0 | 52.5 | 56.0 | 59.0 |
| TypeofContact | 20 | 1 | Self Enquiry | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 20.0 | NaN | NaN | NaN | 1.8 | 1.005249 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 20.0 | NaN | NaN | NaN | 10.5 | 4.135851 | 8.0 | 8.0 | 9.0 | 9.0 | 19.0 |
| Occupation | 20 | 3 | Small Business | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 20 | 2 | Female | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 20.0 | NaN | NaN | NaN | 2.9 | 0.718185 | 2.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 20.0 | NaN | NaN | NaN | 4.3 | 1.128576 | 3.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| ProductPitched | 20 | 1 | King | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 20.0 | NaN | NaN | NaN | 3.6 | 0.680557 | 3.0 | 3.0 | 3.5 | 4.0 | 5.0 |
| MaritalStatus | 20 | 3 | Single | 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 20.0 | NaN | NaN | NaN | 3.35 | 1.785173 | 1.0 | 2.0 | 3.0 | 3.25 | 7.0 |
| Passport | 20.0 | NaN | NaN | NaN | 0.6 | 0.502625 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 20.0 | NaN | NaN | NaN | 3.3 | 1.218282 | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 20.0 | NaN | NaN | NaN | 0.9 | 0.307794 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 20.0 | NaN | NaN | NaN | 1.35 | 0.812728 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 20 | 1 | VP | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 20.0 | NaN | NaN | NaN | 34295.725 | 5331.953768 | 17517.0 | 34470.25 | 34859.0 | 37376.5 | 37376.5 |
| AgeGroup | 20 | 3 | Less_than_60 | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# creating histograms
df_ProdKing.hist(figsize=(14, 14))
plt.show()
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(catnumber_cols)), catnumber_cols):
labeled_barplot(df_ProdKing, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
labeled_barplot(df_ProdKing, cols, perc=True, n=10, hueCol="ProductPitched")
plt.tight_layout()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
Predicting a customer will apply for the product but in reality the customer would not apply - Loss of resources
Predicting a customer will not apply for the product but in reality the customer would have applied for the product. - Loss of opportunity
F1 Score value, we will able to identify the right Customers for the marketing team to help convince Customers to get loans. We will also look at the Recall to be maximized to avoid the loss of resources.# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
#target, pred
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1] # Probability answer.
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4741 entries, 0 to 4740 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4741 non-null int64 1 Age 4741 non-null float64 2 TypeofContact 4741 non-null category 3 CityTier 4741 non-null int64 4 DurationOfPitch 4741 non-null float64 5 Occupation 4741 non-null category 6 Gender 4741 non-null category 7 NumberOfPersonVisiting 4741 non-null int64 8 NumberOfFollowups 4741 non-null float64 9 ProductPitched 4741 non-null category 10 PreferredPropertyStar 4741 non-null float64 11 MaritalStatus 4741 non-null category 12 NumberOfTrips 4741 non-null float64 13 Passport 4741 non-null int64 14 PitchSatisfactionScore 4741 non-null int64 15 OwnCar 4741 non-null int64 16 NumberOfChildrenVisiting 4741 non-null float64 17 Designation 4741 non-null category 18 MonthlyIncome 4741 non-null float64 19 AgeGroup 4741 non-null category dtypes: category(7), float64(7), int64(6) memory usage: 515.3 KB
# Dropping off the following columns since they will not play a part in determing the model for the customers purchasing the new product
df.drop(["AgeGroup"], axis=1, inplace=True)
df.drop(["ProductPitched"], axis=1, inplace=True)
df.drop(["PreferredPropertyStar"], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4741 entries, 0 to 4740 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4741 non-null int64 1 Age 4741 non-null float64 2 TypeofContact 4741 non-null category 3 CityTier 4741 non-null int64 4 DurationOfPitch 4741 non-null float64 5 Occupation 4741 non-null category 6 Gender 4741 non-null category 7 NumberOfPersonVisiting 4741 non-null int64 8 NumberOfFollowups 4741 non-null float64 9 MaritalStatus 4741 non-null category 10 NumberOfTrips 4741 non-null float64 11 Passport 4741 non-null int64 12 PitchSatisfactionScore 4741 non-null int64 13 OwnCar 4741 non-null int64 14 NumberOfChildrenVisiting 4741 non-null float64 15 Designation 4741 non-null category 16 MonthlyIncome 4741 non-null float64 dtypes: category(5), float64(6), int64(6) memory usage: 468.6 KB
X = df.drop(["ProdTaken"], axis=1)
Y = df["ProdTaken"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
XX_train, XX_test, YY_train, YY_test = train_test_split(
X, Y, test_size=0.30, random_state=1,
)
print("Shape of X Training set : ", XX_train.shape)
print("Shape of X test set : ", XX_test.shape)
print("Shape of Y Training set : ", YY_train.shape)
print("Shape of Y test set : ", YY_test.shape)
print("Percentage of classes in training set:")
print(YY_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(YY_test.value_counts(normalize=True))
Shape of X Training set : (3318, 23) Shape of X test set : (1423, 23) Shape of Y Training set : (3318,) Shape of Y test set : (1423,) Percentage of classes in training set: 0 0.806811 1 0.193189 Name: ProdTaken, dtype: float64 Percentage of classes in test set: 0 0.823612 1 0.176388 Name: ProdTaken, dtype: float64
# base_estimator for bagging classifier is a decision tree by default
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(XX_train, YY_train)
# Calculating different metrics
bagging_classifier_model_train_perf = model_performance_classification_sklearn_with_threshold(
bagging_classifier, XX_train, YY_train
)
print("Training performance:\n", bagging_classifier_model_train_perf)
bagging_classifier_model_test_perf = model_performance_classification_sklearn_with_threshold(
bagging_classifier, XX_test, YY_test
)
print("Testing performance:\n", bagging_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(bagging_classifier, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.990054 0.950078 0.998361 0.973621
Testing performance:
Accuracy Recall Precision F1
0 0.911455 0.569721 0.888199 0.694175
Observations:
# Fitting the model
rf_classifier = RandomForestClassifier(random_state=1)
rf_classifier.fit(XX_train, YY_train)
# Calculating different metrics
rf_classifier_model_train_perf = model_performance_classification_sklearn_with_threshold(
rf_classifier, XX_train, YY_train
)
print("Training performance:\n", rf_classifier_model_train_perf)
rf_classifier_model_test_perf = model_performance_classification_sklearn_with_threshold(
rf_classifier, XX_test, YY_test
)
print("Testing performance:\n", rf_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(rf_classifier, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.898805 0.505976 0.863946 0.638191
feature_names = XX_train.columns
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Fitting the model
dt_classifier = DecisionTreeClassifier(random_state=1)
dt_classifier.fit(XX_train, YY_train)
# Calculating different metrics
dt_classifier_model_train_perf = model_performance_classification_sklearn_with_threshold(
dt_classifier, XX_train, YY_train
)
print("Training performance:\n", dt_classifier_model_train_perf)
dt_classifier_model_test_perf = model_performance_classification_sklearn_with_threshold(
dt_classifier, XX_test, YY_test
)
print("Testing performance:\n", dt_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(dt_classifier, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.882642 0.701195 0.656716 0.678227
feature_names = XX_train.columns
importances = dt_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(random_state=1)
cl1 = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"base_estimator": [cl1],
"max_samples": [0.7, 0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
"n_estimators": [10, 20, 30, 40, 50],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(
bagging_estimator_tuned,
parameters,
scoring=acc_scorer,
cv=5,
n_jobs=5,
return_train_score=True,
)
grid_obj = grid_obj.fit(XX_train, YY_train)
# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(XX_train, YY_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
max_features=0.8, max_samples=0.9, n_estimators=50,
random_state=1)
# Calculating different metrics
bagging_estimator_tuned_model_train_perf = model_performance_classification_sklearn_with_threshold(
bagging_estimator_tuned, XX_train, YY_train
)
print("Training performance:\n", bagging_estimator_tuned_model_train_perf)
bagging_estimator_tuned_model_test_perf = model_performance_classification_sklearn_with_threshold(
bagging_estimator_tuned, XX_test, YY_test
)
print("Testing performance:\n", bagging_estimator_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(bagging_estimator_tuned, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.999397 0.99688 1.0 0.998437
Testing performance:
Accuracy Recall Precision F1
0 0.917779 0.589641 0.91358 0.716707
Observations:
# Choose the type of classifier.
bagging_estimator_dTree = BaggingClassifier(
base_estimator=DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.19, 1: 0.81}, random_state=1
),
random_state=1,
)
# Grid of parameters to choose from
parameters = {
"max_samples": [0.7, 0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
"n_estimators": [10, 20, 30, 40, 50],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(
bagging_estimator_dTree,
parameters,
scoring=acc_scorer,
cv=5,
n_jobs=5,
return_train_score=True,
)
grid_obj = grid_obj.fit(XX_train, YY_train)
# Set the clf to the best combination of parameters
bagging_estimator_dTree = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_dTree.fit(XX_train, YY_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.19,
1: 0.81},
random_state=1),
max_features=0.9, max_samples=0.9, n_estimators=50,
random_state=1)
# Calculating different metrics
bagging_estimator_dTree_model_train_perf = model_performance_classification_sklearn_with_threshold(
bagging_estimator_dTree, XX_train, YY_train
)
print("Training performance:\n", bagging_estimator_dTree_model_train_perf)
bagging_estimator_dTree_model_test_perf = model_performance_classification_sklearn_with_threshold(
bagging_estimator_dTree, XX_test, YY_test
)
print("Testing performance:\n", bagging_estimator_dTree_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(bagging_estimator_dTree, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.998794 0.99376 1.0 0.99687
Testing performance:
Accuracy Recall Precision F1
0 0.91286 0.573705 0.89441 0.699029
Observations:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(class_weight={0: 0.20, 1: 0.80}, random_state=1)
parameters = {
"max_depth": list(np.arange(3, 10)),
"n_estimators": [10, 20, 30, 40, 50],
"max_samples": [0.7, 0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(
rf_tuned, parameters, scoring=scorer, cv=5, n_jobs=5, return_train_score=True
)
grid_obj = grid_obj.fit(XX_train, YY_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(XX_train, YY_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_depth=9,
max_features=0.8, max_samples=0.8, n_estimators=20,
random_state=1)
# Calculating different metrics
rf_tuned_model_train_perf = model_performance_classification_sklearn_with_threshold(
rf_tuned, XX_train, YY_train
)
print("Training performance:\n", rf_tuned_model_train_perf)
rf_tuned_model_test_perf = model_performance_classification_sklearn_with_threshold(
rf_tuned, XX_test, YY_test
)
print("Testing performance:\n", rf_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(rf_tuned, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.957806 0.878315 0.9008 0.889415
Testing performance:
Accuracy Recall Precision F1
0 0.888264 0.621514 0.709091 0.66242
feature_names = XX_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Choose the type of classifier.
dtree_estimator_tuned = DecisionTreeClassifier(
class_weight={0: 0.20, 1: 0.80}, random_state=1, criterion="entropy"
)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 10),
"min_samples_leaf": [5, 7, 10, 15],
"max_leaf_nodes": [2, 3, 5, 10, 15],
"min_impurity_decrease": [0.0001, 0.001, 0.01, 0.1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(
dtree_estimator_tuned,
parameters,
scoring=scorer,
cv=5,
n_jobs=5,
return_train_score=True,
)
grid_obj = grid_obj.fit(XX_train, YY_train)
# Set the clf to the best combination of parameters
dtree_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator_tuned.fit(XX_train, YY_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=6, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=5,
random_state=1)
# Calculating different metrics
dtree_estimator_model_train_perf = model_performance_classification_sklearn_with_threshold(
dtree_estimator_tuned, XX_train, YY_train
)
print("Training performance:\n", dtree_estimator_model_train_perf)
dtree_estimator_model_test_perf = model_performance_classification_sklearn_with_threshold(
dtree_estimator_tuned, XX_test, YY_test
)
print("Testing performance:\n", dtree_estimator_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(dtree_estimator_tuned, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.751356 0.75195 0.419861 0.538849
Testing performance:
Accuracy Recall Precision F1
0 0.724526 0.681275 0.354037 0.46594
feature_names = XX_train.columns
importances = dtree_estimator_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# training performance comparison
baggingtech_models_train_comp_df = pd.concat(
[
bagging_classifier_model_train_perf.T,
bagging_estimator_tuned_model_train_perf.T,
bagging_estimator_dTree_model_train_perf.T,
rf_classifier_model_train_perf.T,
rf_tuned_model_train_perf.T,
dt_classifier_model_train_perf.T,
dtree_estimator_model_train_perf.T,
],
axis=1,
)
baggingtech_models_train_comp_df.columns = [
"Bagging",
"Bagging Tuned",
"Bagging Weighted DTree",
"Random Forest",
"Random Forest Tuned",
"Decision Tree",
"Decision Tree Tuned"
]
# testing performance comparison
baggingtech_models_test_comp_df = pd.concat(
[
bagging_classifier_model_test_perf.T,
bagging_estimator_tuned_model_test_perf.T,
bagging_estimator_dTree_model_test_perf.T,
rf_classifier_model_test_perf.T,
rf_tuned_model_test_perf.T,
dt_classifier_model_test_perf.T,
dtree_estimator_model_test_perf.T,
],
axis=1,
)
baggingtech_models_test_comp_df.columns = [
"Bagging",
"Bagging Tuned",
"Bagging Weighted DTree",
"Random Forest",
"Random Forest Tuned",
"Decision Tree",
"Decision Tree Tuned"
]
print("Bagging Technique: Training performance comparison:")
baggingtech_models_train_comp_df
Bagging Technique: Training performance comparison:
| Bagging | Bagging Tuned | Bagging Weighted DTree | Random Forest | Random Forest Tuned | Decision Tree | Decision Tree Tuned | |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.990054 | 0.999397 | 0.998794 | 1.0 | 0.957806 | 1.0 | 0.751356 |
| Recall | 0.950078 | 0.996880 | 0.993760 | 1.0 | 0.878315 | 1.0 | 0.751950 |
| Precision | 0.998361 | 1.000000 | 1.000000 | 1.0 | 0.900800 | 1.0 | 0.419861 |
| F1 | 0.973621 | 0.998437 | 0.996870 | 1.0 | 0.889415 | 1.0 | 0.538849 |
print("Bagging Technique: Test set performance comparison:")
baggingtech_models_test_comp_df
Bagging Technique: Test set performance comparison:
| Bagging | Bagging Tuned | Bagging Weighted DTree | Random Forest | Random Forest Tuned | Decision Tree | Decision Tree Tuned | |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.911455 | 0.917779 | 0.912860 | 0.898805 | 0.888264 | 0.882642 | 0.724526 |
| Recall | 0.569721 | 0.589641 | 0.573705 | 0.505976 | 0.621514 | 0.701195 | 0.681275 |
| Precision | 0.888199 | 0.913580 | 0.894410 | 0.863946 | 0.709091 | 0.656716 | 0.354037 |
| F1 | 0.694175 | 0.716707 | 0.699029 | 0.638191 | 0.662420 | 0.678227 | 0.465940 |
The tuned Bagging model is the best model that can be used to build a predictive model that can be used by the travel agency to find the customers who will avail the new Product. It has the highest F1 score of approx 99.8% on training data but is overfitting the training data.
The Bagging weighted Decision Tree also has a better F1 score on the training data but has a lesser F1 score on the test data and overfitting
The Tuned Random Forest has the next better F1 score when compared with the remaining models. It is giving a more generalized performance as compared to the bagging models.
From the model analysis, features such as Passport, Monthly Income, Age, Duration of Pitch, Designation, Number of trips, Number of followups and city play an important part in identifying the possible customers
feature_names = XX_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Fitting the model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(XX_train, YY_train)
# Calculating different metrics
ab_classifier_model_train_perf = model_performance_classification_sklearn_with_threshold(
ab_classifier, XX_train, YY_train
)
print(ab_classifier_model_train_perf)
ab_classifier_model_test_perf = model_performance_classification_sklearn_with_threshold(
ab_classifier, XX_test, YY_test
)
print(ab_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(ab_classifier, XX_test, YY_test)
Accuracy Recall Precision F1 0 0.848704 0.372855 0.705015 0.487755 Accuracy Recall Precision F1 0 0.847505 0.302789 0.644068 0.411924
feature_names = XX_train.columns
importances = ab_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(XX_train, YY_train)
# Calculating different metrics
gb_classifier_model_train_perf = model_performance_classification_sklearn_with_threshold(
gb_classifier, XX_train, YY_train
)
print(gb_classifier_model_train_perf)
gb_classifier_model_test_perf = model_performance_classification_sklearn_with_threshold(
gb_classifier, XX_test, YY_test
)
print(gb_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(gb_classifier, XX_test, YY_test)
Accuracy Recall Precision F1 0 0.88909 0.482059 0.895652 0.626775 Accuracy Recall Precision F1 0 0.878426 0.414343 0.8 0.545932
feature_names = XX_train.columns
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric="logloss")
xgb_classifier.fit(XX_train, YY_train)
# Calculating different metrics
xgb_classifier_model_train_perf = model_performance_classification_sklearn_with_threshold(
xgb_classifier, XX_train, YY_train
)
print("Training performance:\n", xgb_classifier_model_train_perf)
xgb_classifier_model_test_perf = model_performance_classification_sklearn_with_threshold(
xgb_classifier, XX_test, YY_test
)
print("Testing performance:\n", xgb_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(xgb_classifier, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.999699 0.99844 1.0 0.999219
Testing performance:
Accuracy Recall Precision F1
0 0.923401 0.657371 0.87766 0.751708
feature_names = XX_train.columns
importances = xgb_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=1),
DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3),
],
"n_estimators": np.arange(10, 110, 10),
"learning_rate": np.arange(0.1, 2, 0.1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(
abc_tuned, parameters, scoring=scorer, cv=5, n_jobs=5, return_train_score=True,
)
grid_obj = grid_obj.fit(XX_train, YY_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(XX_train, YY_train)
# Calculating different metrics
abc_tuned_model_train_perf = model_performance_classification_sklearn_with_threshold(
abc_tuned, XX_train, YY_train
)
print(abc_tuned_model_train_perf)
abc_tuned_model_test_perf = model_performance_classification_sklearn_with_threshold(
abc_tuned, XX_test, YY_test
)
print(abc_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(abc_tuned, XX_test, YY_test)
Accuracy Recall Precision F1 0 0.985835 0.946958 0.979032 0.962728 Accuracy Recall Precision F1 0 0.860155 0.609562 0.602362 0.605941
feature_names = XX_train.columns
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100, 150, 200, 250],
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(
gbc_tuned, parameters, scoring=scorer, cv=5, n_jobs=5, return_train_score=True
)
grid_obj = grid_obj.fit(XX_train, YY_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(XX_train, YY_train)
# Calculating different metrics
gbc_tuned_model_train_perf = model_performance_classification_sklearn_with_threshold(
gbc_tuned, XX_train, YY_train
)
print("Training performance:\n", gbc_tuned_model_train_perf)
gbc_tuned_model_test_perf = model_performance_classification_sklearn_with_threshold(
gbc_tuned, XX_test, YY_test
)
print("Testing performance:\n", gbc_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(gbc_tuned, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.927667 0.652106 0.96092 0.776952
Testing performance:
Accuracy Recall Precision F1
0 0.885453 0.47012 0.797297 0.591479
feature_names = XX_train.columns
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")
# Grid of parameters to choose from
parameters = {
"n_estimators": [10, 30, 50],
"scale_pos_weight": [1, 2, 5],
"subsample": [0.7, 0.9, 1],
"learning_rate": [0.05, 0.1, 0.2],
"colsample_bytree": [0.7, 0.9, 1],
"colsample_bylevel": [0.5, 0.7, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(XX_train, YY_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(XX_train, YY_train)
# Calculating different metrics
xgb_tuned_model_train_perf = model_performance_classification_sklearn_with_threshold(
xgb_tuned, XX_train, YY_train
)
print("Training performance:\n", xgb_tuned_model_train_perf)
xgb_tuned_model_test_perf = model_performance_classification_sklearn_with_threshold(
xgb_tuned, XX_test, YY_test
)
print("Testing performance:\n", xgb_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(xgb_tuned, XX_test, YY_test)
Training performance:
Accuracy Recall Precision F1
0 0.971971 0.99064 0.879501 0.931768
Testing performance:
Accuracy Recall Precision F1
0 0.894589 0.808765 0.665574 0.730216
feature_names = XX_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
# training performance comparison
boostingtech_models_train_comp_df = pd.concat(
[
ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,
gb_classifier_model_train_perf.T,
gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,
],
axis=1,
)
boostingtech_models_train_comp_df.columns = [
"AdaBoost",
"AdaBoost Tuned",
"Gradient",
"Gradient Tuned",
"XGBoost",
"XGBoost Tuned"
]
# testing performance comparison
boostingtech_models_test_comp_df = pd.concat(
[
ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,
gb_classifier_model_test_perf.T,
gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,
],
axis=1,
)
boostingtech_models_test_comp_df.columns = [
"AdaBoost",
"AdaBoost Tuned",
"Gradient",
"Gradient Tuned",
"XGBoost",
"XGBoost Tuned"
]
print("Boosting Technique: Training performance comparison:")
boostingtech_models_train_comp_df
Boosting Technique: Training performance comparison:
| AdaBoost | AdaBoost Tuned | Gradient | Gradient Tuned | XGBoost | XGBoost Tuned | |
|---|---|---|---|---|---|---|
| Accuracy | 0.848704 | 0.985835 | 0.889090 | 0.927667 | 0.999699 | 0.971971 |
| Recall | 0.372855 | 0.946958 | 0.482059 | 0.652106 | 0.998440 | 0.990640 |
| Precision | 0.705015 | 0.979032 | 0.895652 | 0.960920 | 1.000000 | 0.879501 |
| F1 | 0.487755 | 0.962728 | 0.626775 | 0.776952 | 0.999219 | 0.931768 |
print("Boosting Technique: Test set performance comparison:")
boostingtech_models_test_comp_df
Boosting Technique: Test set performance comparison:
| AdaBoost | AdaBoost Tuned | Gradient | Gradient Tuned | XGBoost | XGBoost Tuned | |
|---|---|---|---|---|---|---|
| Accuracy | 0.847505 | 0.860155 | 0.878426 | 0.885453 | 0.923401 | 0.894589 |
| Recall | 0.302789 | 0.609562 | 0.414343 | 0.470120 | 0.657371 | 0.808765 |
| Precision | 0.644068 | 0.602362 | 0.800000 | 0.797297 | 0.877660 | 0.665574 |
| F1 | 0.411924 | 0.605941 | 0.545932 | 0.591479 | 0.751708 | 0.730216 |
The XGBoost model is the best model in the Boosting techniques that can be used to build a predictive model by the travel agency to find the customers who will avail the new Product. It has the highest F1 score of approx 99.9% on training data but is overfitting considering the F1 score with test data.
The AdaBoost Tuned model has the next better F1 score of 96.3% on the training data but has a larger variance considering the F1 score with test data.
The XGBoost Tuned model has the next better F1 score on the training data. Its having a more generalized performance compared to the other boosting models.
The Gradient Tuned model is the next best fit with a score of 77.7% and has a more generalized fitment and working well with the Training & Test data.
From the model analysis, features such as Passport, Designation_Exec, Maritial Status Single, Occupation, City Tier, Marital Status_Married and Desgination SM play an important part in identifying the possible customers
feature_names = XX_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
estimators = [
("Bagging Tuned", bagging_estimator_tuned),
("Gradient Boosting", gb_classifier),
("Decision Tree Tuned", dtree_estimator_tuned),
]
final_estimator = rf_tuned
stacking_classifier_BagT_Grad_DTreeT_RFT = StackingClassifier(
estimators=estimators, final_estimator=final_estimator
)
stacking_classifier_BagT_Grad_DTreeT_RFT.fit(XX_train, YY_train)
StackingClassifier(estimators=[('Bagging Tuned',
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
max_features=0.8,
max_samples=0.9,
n_estimators=50,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(random_state=1)),
('Decision Tree Tuned',
DecisionTreeClassifier(class_weight={0: 0.2,
1: 0.8},
criterion='entropy',
max_depth=6,
max_leaf_nodes=15,
min_impurity_decrease=0.0001,
min_samples_leaf=5,
random_state=1))],
final_estimator=RandomForestClassifier(class_weight={0: 0.2,
1: 0.8},
max_depth=9,
max_features=0.8,
max_samples=0.8,
n_estimators=20,
random_state=1))
# Calculating different metrics
stacking_classifier_BagT_Grad_DTreeT_RFT_Train = model_performance_classification_sklearn_with_threshold(
stacking_classifier_BagT_Grad_DTreeT_RFT, XX_train, YY_train
)
print("Training performance:\n", stacking_classifier_BagT_Grad_DTreeT_RFT_Train)
stacking_classifier_BagT_Grad_DTreeT_RFT_Test = model_performance_classification_sklearn_with_threshold(
stacking_classifier_BagT_Grad_DTreeT_RFT, XX_test, YY_test
)
print("Testing performance:\n", stacking_classifier_BagT_Grad_DTreeT_RFT_Test)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(
stacking_classifier_BagT_Grad_DTreeT_RFT, XX_test, YY_test
)
Training performance:
Accuracy Recall Precision F1
0 0.99789 0.99532 0.993769 0.994544
Testing performance:
Accuracy Recall Precision F1
0 0.907238 0.760956 0.726236 0.743191
Observations:
Stacking Model - Base estimators(DecisionTree Tuned, Bagging Tuned, Gradient) & Final estimator(RandomForest Tuned)
estimators = [
("AdaBoost Tuned", abc_tuned),
("Gradient Tuned", gbc_tuned),
("Decision Tree", dt_classifier),
]
final_estimator = xgb_tuned
stacking_classifier_AdaT_GradT_DTree_XGBT = StackingClassifier(
estimators=estimators, final_estimator=final_estimator
)
stacking_classifier_AdaT_GradT_DTree_XGBT.fit(XX_train, YY_train)
StackingClassifier(estimators=[('AdaBoost Tuned',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=1.5000000000000002,
n_estimators=100,
random_state=1)),
('Gradient Tuned',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.8,
n_estimators=250,
random_state=1,
subsample=0.8)),
('Decision Tree',...
eval_metric='logloss', gamma=0,
gpu_id=-1,
grow_policy='depthwise',
importance_type=None,
interaction_constraints='',
learning_rate=0.2, max_bin=256,
max_cat_to_onehot=4,
max_delta_step=0, max_depth=6,
max_leaves=0,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=50, n_jobs=0,
num_parallel_tree=1,
predictor='auto',
random_state=1, reg_alpha=0,
reg_lambda=1, ...))
# Calculating different metrics
stacking_classifier_AdaT_GradT_DTree_XGBT_Train = model_performance_classification_sklearn_with_threshold(
stacking_classifier_AdaT_GradT_DTree_XGBT, XX_train, YY_train
)
print("Training performance:\n", stacking_classifier_AdaT_GradT_DTree_XGBT_Train)
stacking_classifier_AdaT_GradT_DTree_XGBT_Test = model_performance_classification_sklearn_with_threshold(
stacking_classifier_AdaT_GradT_DTree_XGBT, XX_test, YY_test
)
print("Testing performance:\n", stacking_classifier_AdaT_GradT_DTree_XGBT_Test)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(
stacking_classifier_AdaT_GradT_DTree_XGBT, XX_test, YY_test
)
Training performance:
Accuracy Recall Precision F1
0 0.959916 1.0 0.828165 0.906007
Testing performance:
Accuracy Recall Precision F1
0 0.853127 0.808765 0.557692 0.660163
Observations:
estimators = [
("Bagging Weighted DTree", bagging_estimator_dTree),
("AdaBoost Tuned", abc_tuned),
("Random Forest Tuned", rf_tuned),
("Decision Tree Tuned", dtree_estimator_tuned),
]
final_estimator = xgb_tuned
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT = StackingClassifier(
estimators=estimators, final_estimator=final_estimator
)
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT.fit(XX_train, YY_train)
StackingClassifier(estimators=[('Bagging Weighted DTree',
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.19,
1: 0.81},
random_state=1),
max_features=0.9,
max_samples=0.9,
n_estimators=50,
random_state=1)),
('AdaBoost Tuned',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=1.5000000000000002,
n_estima...
eval_metric='logloss', gamma=0,
gpu_id=-1,
grow_policy='depthwise',
importance_type=None,
interaction_constraints='',
learning_rate=0.2, max_bin=256,
max_cat_to_onehot=4,
max_delta_step=0, max_depth=6,
max_leaves=0,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=50, n_jobs=0,
num_parallel_tree=1,
predictor='auto',
random_state=1, reg_alpha=0,
reg_lambda=1, ...))
# Calculating different metrics
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Train = model_performance_classification_sklearn_with_threshold(
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT, XX_train, YY_train
)
print(
"Training performance:\n", stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Train
)
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Test = model_performance_classification_sklearn_with_threshold(
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT, XX_test, YY_test
)
print("Testing performance:\n", stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Test)
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT, XX_test, YY_test
)
Training performance:
Accuracy Recall Precision F1
0 0.991863 1.0 0.959581 0.979374
Testing performance:
Accuracy Recall Precision F1
0 0.888967 0.856574 0.637982 0.731293
Observations:
Stacking Model - Base estimators(DecisionTree Tuned, AdaBoost Tuned, Random Forest Tuned) & Final estimator(XGBoost Tuned)
# training performance comparison
stacking_models_train_comp_df = pd.concat(
[
stacking_classifier_BagT_Grad_DTreeT_RFT_Train.T,
stacking_classifier_AdaT_GradT_DTree_XGBT_Train.T,
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Train.T,
],
axis=1,
)
stacking_models_train_comp_df.columns = [
"BagT_Grad_DTreeT_RFT",
"AdaT_GradT_DTree_XGBT",
"BagDtree_AdaT_RFT_DTreeT_XGBT"
]
# training performance comparison
stacking_models_test_comp_df = pd.concat(
[
stacking_classifier_BagT_Grad_DTreeT_RFT_Test.T,
stacking_classifier_AdaT_GradT_DTree_XGBT_Test.T,
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Test.T,
],
axis=1,
)
stacking_models_test_comp_df.columns = [
"BagT_Grad_DTreeT_RFT",
"AdaT_GradT_DTree_XGBT",
"BagDtree_AdaT_RFT_DTreeT_XGBT"
]
print("Bagging Technique: Training performance comparison:")
stacking_models_train_comp_df
Bagging Technique: Training performance comparison:
| BagT_Grad_DTreeT_RFT | AdaT_GradT_DTree_XGBT | BagDtree_AdaT_RFT_DTreeT_XGBT | |
|---|---|---|---|
| Accuracy | 0.997890 | 0.959916 | 0.991863 |
| Recall | 0.995320 | 1.000000 | 1.000000 |
| Precision | 0.993769 | 0.828165 | 0.959581 |
| F1 | 0.994544 | 0.906007 | 0.979374 |
print("Bagging Technique: Test set performance comparison:")
stacking_models_test_comp_df
Bagging Technique: Test set performance comparison:
| BagT_Grad_DTreeT_RFT | AdaT_GradT_DTree_XGBT | BagDtree_AdaT_RFT_DTreeT_XGBT | |
|---|---|---|---|
| Accuracy | 0.907238 | 0.853127 | 0.888967 |
| Recall | 0.760956 | 0.808765 | 0.856574 |
| Precision | 0.726236 | 0.557692 | 0.637982 |
| F1 | 0.743191 | 0.660163 | 0.731293 |
Stacking Model - Base estimators(DecisionTree Tuned, Bagging Tuned, Gradient) & Final estimator(RandomForest Tuned)
Stacking Model - Base estimators(AdaBoost Tuned, Gradient Tuned, DecisionTree) & Final estimator(XGBoost Tuned)
Stacking Model - Base estimators(Weighted Bagging, DecisionTree Tuned, AdaBoost Tuned, Random Forest Tuned) & Final estimator(XGBoost Tuned)
# training performance comparison for the 3 best models from each technique
comparison_models_train_comp_df = pd.concat(
[
bagging_estimator_tuned_model_train_perf.T,
bagging_estimator_dTree_model_train_perf.T,
rf_tuned_model_train_perf.T,
abc_tuned_model_train_perf.T,
gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,
stacking_classifier_BagT_Grad_DTreeT_RFT_Train.T,
stacking_classifier_AdaT_GradT_DTree_XGBT_Train.T,
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Train.T,
],
axis=1,
)
comparison_models_train_comp_df.columns = [
"Bagging Tuned",
"Bagging Weighted DTree",
"Random Forest Tuned",
"AdaBoost Tuned",
"Gradient Tuned",
"XGBoost",
"XGBoost Tuned",
"Stack BagT Grad DTreeT RFT",
"Stack AdaT GradT DTree XGBT",
"Stack BDtree AdaT RFT DTreeT XGBT",
]
# training performance comparison for the 3 best models from each technique
comparison_models_test_comp_df = pd.concat(
[
bagging_estimator_tuned_model_test_perf.T,
bagging_estimator_dTree_model_test_perf.T,
rf_tuned_model_test_perf.T,
abc_tuned_model_test_perf.T,
gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,
stacking_classifier_BagT_Grad_DTreeT_RFT_Test.T,
stacking_classifier_AdaT_GradT_DTree_XGBT_Test.T,
stacking_classifier_BagDtree_AdaT_RFT_DTreeT_XGBT_Test.T,
],
axis=1,
)
comparison_models_test_comp_df.columns = [
"Bagging Tuned",
"Bagging Weighted DTree",
"Random Forest Tuned",
"AdaBoost Tuned",
"Gradient Tuned",
"XGBoost",
"XGBoost Tuned",
"Stack BagT Grad DTreeT RFT",
"Stack AdaT GradT DTree XGBT",
"Stack BDtree AdaT RFT DTreeT XGBT",
]
print(
"Overall Comparison (Top 3 models from each technique): Training set performance:"
)
comparison_models_train_comp_df.mul(100).round(decimals=2).astype(str).add("%")
Overall Comparison (Top 3 models from each technique): Training set performance:
| Bagging Tuned | Bagging Weighted DTree | Random Forest Tuned | AdaBoost Tuned | Gradient Tuned | XGBoost | XGBoost Tuned | Stack BagT Grad DTreeT RFT | Stack AdaT GradT DTree XGBT | Stack BDtree AdaT RFT DTreeT XGBT | |
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 99.94% | 99.88% | 95.78% | 98.58% | 92.77% | 99.97% | 97.2% | 99.79% | 95.99% | 99.19% |
| Recall | 99.69% | 99.38% | 87.83% | 94.7% | 65.21% | 99.84% | 99.06% | 99.53% | 100.0% | 100.0% |
| Precision | 100.0% | 100.0% | 90.08% | 97.9% | 96.09% | 100.0% | 87.95% | 99.38% | 82.82% | 95.96% |
| F1 | 99.84% | 99.69% | 88.94% | 96.27% | 77.7% | 99.92% | 93.18% | 99.45% | 90.6% | 97.94% |
print("Overall Comparison (Top 3 models from each technique): Test set performance:")
comparison_models_test_comp_df.mul(100).round(decimals=2).astype(str).add("%")
Overall Comparison (Top 3 models from each technique): Test set performance:
| Bagging Tuned | Bagging Weighted DTree | Random Forest Tuned | AdaBoost Tuned | Gradient Tuned | XGBoost | XGBoost Tuned | Stack BagT Grad DTreeT RFT | Stack AdaT GradT DTree XGBT | Stack BDtree AdaT RFT DTreeT XGBT | |
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 91.78% | 91.29% | 88.83% | 86.02% | 88.55% | 92.34% | 89.46% | 90.72% | 85.31% | 88.9% |
| Recall | 58.96% | 57.37% | 62.15% | 60.96% | 47.01% | 65.74% | 80.88% | 76.1% | 80.88% | 85.66% |
| Precision | 91.36% | 89.44% | 70.91% | 60.24% | 79.73% | 87.77% | 66.56% | 72.62% | 55.77% | 63.8% |
| F1 | 71.67% | 69.9% | 66.24% | 60.59% | 59.15% | 75.17% | 73.02% | 74.32% | 66.02% | 73.13% |
# Calculating variance of F1 score between Training & Testing data
variance_df = comparison_models_train_comp_df - comparison_models_test_comp_df
variance_df[variance_df.index == "F1"].mul(100).round(decimals=2).astype(str).add("%")
| Bagging Tuned | Bagging Weighted DTree | Random Forest Tuned | AdaBoost Tuned | Gradient Tuned | XGBoost | XGBoost Tuned | Stack BagT Grad DTreeT RFT | Stack AdaT GradT DTree XGBT | Stack BDtree AdaT RFT DTreeT XGBT | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1 | 28.17% | 29.78% | 22.7% | 35.68% | 18.55% | 24.75% | 20.16% | 25.14% | 24.58% | 24.81% |
feature_names = XX_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
feature_names = XX_train.columns
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Model Analysis:
Based on the comparison of the top models picked from each technique:
- Considering the F1 score of the Training data, default XGBoost has the highest score of 99.92%, followed by "Bagging Weighted DTree" with 99.69%, followed by "Stack BagT Grad DTreeT RFT" model with 99.45%. All these models are over fitting considering the variance (approximately around >25%) with the testing F2 scores
- "XGBoost Tuned" has the next better fitting model with a F1 score of 93.18% on training data and 73.02% on the testing data, with a variance of 20.2%. Accuracy is at 97.2%. This model doesnt seem to be overfitting and has a better predictive model with the testing data
- "Gradient Tuned" has a better generalization with the F1 score of 77.7% on training data and 59.15% on the testing date, with a variance of 18.6% only. Accuracy is at 92.77%. This model doesnt seem to be overfitting and has a better predictive model with the testing data
Important Features:
- From the "XGBoost Tuned" model analysis, features such as Passport, Designation_Exec, Maritial Status Single, Product Pitched Deluxe, Desgination SM, Product Pitched Super Deluxe and City Tier play an important part in identifying the possible customers
- From the "Gradient Tuned" model analysis, features such as Monthly Income, Passport, Age, Designation_Exec, DurationOfPitch, Status_Single, Number of Followups, City Tier and Number of Trips play an importan part in identifying the possible customers
Observation:
- As the final results depend on the parameters used/checked using GridSearchCV, there may be yet better parameters which may result in a better F1 performance and can be tuned further.
Based on the Customer Information:
Based on the Products taken by the Customers, we found the following insights that can be leveraged as recommendations for understanding the Customers: